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METHOD OF, AND SYSTEM FOR, SCANNING ELECTRONIC DOCUMENTS 
WfflCH CONTAIN LINKS TO EXTERNAL OBJECTS 

Introduction 

5 The present invention relates to a method of, and system for, replacing external 

links in electronic documents such as email witii internal links. One use of this is to ensure 
that email that attempts to bypass email content scaimers no longer succeeds. Another use is 
to reduce the effectiveness of web bugs. 

10 Background 

Content scaxming can be carried out at a number of places in tilie passage of 
electronic documents from one system to another. Taking email as an example, it may be 
carried out by software operated by the user, e.g. incorporated in or an adjunct to, his email 
client, and it may be carried out on a mail server to which the user connects, over a LAN or 

15 WAN, in order to retrieve email. Also, Intemet Service Providers (ISPs) can carry out 
content scanning as a value-added service on behalf of customers who, for example, then 
retrieve their content-scanned email via a POPS account or similar. 

One trick which can be used to bypass email content scanners is to create an 
email which just contains a link (such as an HTML hyperlink) to the xmdesirable or "nasty" 

20 content Such content may include viruses and other varieties of malware as well potentially 
offensive material such as pornographic images and text, spam and other material to which 
the email recipient may not wish to be subjected. The content scaimer sees only the link, 
which is not suspicious, and the email is let through. However, when viewed in the email 
client, the object referred to may either be bought in automatically by the email client, or 

25 when the reader clicks on the link. Thus, the nasty object ends up on the user's desktop, 
without ever passing through the email content scaimer. 
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It is possible for the content scanner to download tihe object by following the 
link itself. It can then scan the object. However, this method is not foolproof - for instance, 
the server delivering the object to the content scanner may be able to detect that the request is 
from a content scaimer and not from the end user. It may then serve up a different, innocent 
5 object to be scanned. However, when the end-user requests the object, they get the nasty one. 

Summary of the Invention 

The present mvention seeks to reduce or eliminate the problems of embedded 
links in electronic documents and does so by having the content scaimer attempt to follow a 
10 link foimd in an electronic document and scan the object which is the target of tihe link. If the 
object is found to be acceptable from the point of view of content-scanning criteria, it is 
retrieved by the scanner and embedded in the electronic document and the link in the 
electronic document is adjusted to point at the embedded object rather than the original; this 
can then be delivered to the recipient without tiie possibility that the version received by the 
15 recipient differs from the one originally scaimed. 

If the object is not found to be acceptable, one or more remedial actions may be 
taken: for example, the link may be replaced by a non-functional link and/or a notice that the 
original link has been removed and why; another possibility is that the electronic document 
can be quarantined and an email or alert generated and sent to the intended recipient advising 
20 him that this has been done and perhaps including a link via which he can retrieve it 
nevertheless or delete it The process of following links, scanning the linked object and 
replacing it or not with an embedded copy and an adjusted link may be applied recursively. 
An upper limit may be placed on the number of recursion levels, to stop the system getting 
stuck in an infinite loop (e.g. because there are circular links) and to effectively limit the 
25 amoimt of time the processing will take. 



Thus according to the present invention there is provided a content scanning 
sj^tem for electronic documents such as emails comprising: 

a) a link analyser for identifying hyperlinks in document content; 

b) means for causing a content scanner to scan objects referenced by links 
identified by the link analyser and to determine their acceptability according to predefined 
rules, the means being operative, when the link is to an object extemal to the document and is 
determined by the content analyser to be acceptable, to retrieve the extemal object and modify 
the document by 

b 1 . . embedding in it or attaching to it the retrieved copy of the object; and 
b2. replacing the link to the extemal object by one to the copy embedded 

in, or attached to, the document 

The invention also provides a method of content-scanning electronic 

documents such as emails comprising: 

a) using a link analyser for identifying hyperlinks in document content; 

b) using a content scanner to scan objects referenced by links identified by 
the link analyser and to determine their acceptabiUty according to predefined rules, the means 
being operative, when the link is to an object extemal to the document and is deteraiined by 
the content analyser to be acceptable, to retrieve the extemal object and modify the document 
by 

bl. embedding in it or attaching to it the retrieved copy of the object; and 
b2. replacing the link to the extemal object by one to the copy embedded 
in, or attached to, the document. 

Thus the content scaimer can follow the link, and download and scan the 
object. If the object is judged satisfactory, the object can then be embedded in the email, and 
the link to the extemal object replaced by a link to Ifae object now embedded in the email. 
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One trick used by spammers is to embody *web bugs' in their spam emails. 
These are unique or semi-unique links to web sites - so a spammer sending out 1000 emails 
would use 1000 different links. When the email is read, a coimection is made to the web site, 
and by finding which link has been hit, the spammer can match it with their records to tell 
5 which person has read the spam email. This then confirms that the email address is a genuine 
one. The spammer can continue to send email to that address, or perhaps even sell the address 
on to other spammers. 

By following every external link in every email that passes through the content 
scajoner, all the web bugs the spammer sends out will be activated. Their effectiveness 
10 therefore becomes much reduced, because they can no longer be used to tell which email 
addresses were valid or not. 

The invention will be fiuther described by way of non-limiting example with 
reference to the accompanying drawings, in which> 

Figurel shows the ''before" and "after" states of an email processed by an 
1 5 embodiment of the present invention; and 

Figure 2 shows a system embodying the present invention. 

Figure la shows an email 1 which comprises a header region 2 and a body 3 
formatted according to an internet (e.g. SMTP/MIME) format. The body 3 includes a 
hypertext link 4 which points to an object 5 on a web server 6 somewhere on the intemet. . 
20 The object 5 may for example be a graphical image embedded in a web page (e.g. HTML or 
XHTML); 

Figure lb shows the email 1 after processing by the illustrated embodiment of 
the invention and it will be seen that the object 5 has been appended to the email (e.g. as a 
MIME attachment) as item 5' and the link 4 has been adjusted so that it now points to this 
25 version of the object rather than the one held on the external server 6; and 



5 

Figure 2 is an illustration of a system 10, according to the present invention 
which may be implemented as a software automaton. Although the invention is not limited to 
this application, this example embodiment is given in terms of a content scanner operated by 
an ISP to process an email stream e.g. passing through an email gateway. 

5 

Op eration of Embodiment 

1) The email is analysed by analyser 20 to determine whether it contains 
external links. If none are found, omit steps 2 to 5. 

2) For each extemal link, the external object is obtained by object fetcher 
10 30 jfrom the internet. If the object cannot be obtained, go to step 7. 

3) The extemal objects are scaimed by analyser 40 for pornography, 
viruses, spam and oflier undesirables. If any are foimd, go to step 7. 

4) The extemal objects are analysed to see whether they contain extemal 
links. If the nesting- limit has been reached, go to step 7. Otherwise go to step 2 for each 

15 extemal link. 

5) The email is now rebuilt by email rebuilder 50, In the case of MIME 
email, the extemal links are replaced with intemal links, and flie objects obtained are added to 
the email as MIME sections. Non-MEME email is first converted to MIME email, and the 
process then continues as before. 

20 6) The email is sent on, and processing stops in respect of that email. 

7) An undesirable object has been found, or the object could not be 
retrieved, or the nesting limit has been reached. We may wish to block the email (processiiig 
stops), or to remove the links. We may also want to send waming messages to sender and 
recipient if the email has been blocked. Meanwhile the email may be held in quarantine as 

25 indicated at 60, which may be implemented as a reserved file directory. 



Example 



The following email contains a link to a website. 



Subject: email with link 
Subject: 

Date: Thu, 9 May 2002 16:17:01 +0600 
MIME-Version: 1.0 
Content -Type : text/html; 
Content-Transfer-Encoding: 7bit 

<IDOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional/ /EN "> 

<HTML><HEAD> 

</HEAD> 

<BODY bgColor=3D#ffffff> 
<DIV>  </DIV> 
This is some text<BR> ' 

<DIV><IMAGE src^^'http: //www.messagelabs . com/images/global/nav/box-images/virus-eye- 

light.gif" > 

</DIV> 

This is some more text<BR> 
</BODY></HTML> 

The binary content of http://www.messagelabs.coi33/images/dobai/nav/boX"images/virus-ev^ 
light. gif is as follows: 
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This file can be downloaded, scanned, and if acceptable, a new email can be created with the 
image embedded in the email: 

Subject: email with link 
Subject; 

Date:.Thu, 9 May 2002 16:17:01 +0600 

MIME-Version: 1,0 

Content-Type: multipart/related; 

boundary=»'ABCD"; 
Content-Transfer-Encoding: 7bit 



— ABCD 
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Content-Type: text/html; 
Content-Transfer-Encoding:. 7bit 

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> 
5 <HTML><HEAD> 
</HEAD> 

<BODy bgColor«3D#f fffff> 
<DIV>  </DIV> 
This is some text<BR> 
10 <DIV><IMAGE src=cid:EXTERNAL> 
</DIV> 

This is some more text<BR> 
</BODY></HTML> 

15 —ABCD 

Content-ID: <EXTERNAL> 
Content-Type : image/gif ; 

name="image001 . gif " 
Content-Transfer-Encoding: base64 
20 Content-Disposition: attachment; 

f ilename="image001 . gif" 

R01GODlhFwAXAMQAAICAgGRWBAAAAFJSU8irBP39/f/yAKqQBP/OAAshVzQrA8bGx7Cuq412BRYV 
, FxgUAiYnLMaxTL2+w+/IiAyQdAgMLHgAqhEc8AwwKBZONcqGfl+Lh4dvh9ti6A//tAeS+HiH5BAAA 
25 AAAALAAAAAAXABcAAAX2ICCOZGmOSKqu6mQg74uI7PoSTXNOBP/SNcOkcwg8FIpirgdcDTtIjEDw 
uCAbgUMzZSBcGtOwmHIJiAzoV+ciboelogPW+nDbpyLpXeDQtx9xDwEUbwMZCxoOU3pJHSIBAWB8 
ABIFBRIDAwIJAwAYFAEEjxllEAuWlgMQmhYJG5oCFyITYBinqAWwFaOMEFOAABOEA7iWDA4JFhYV 
DoJlcVMZxZYIiAxW/Bx5DImwCDLgbEhmqAhgBHTIGIsMLAJkQfZ91BGgrccQZkJA6BC71LGcicIiQ 
pmANewAQf NgQ4 aDDFDQ+FNCA7mGNi JcCyLCo4oTH j yEAADs= 

30 



— ABCD — 
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CLAIMS 

1. A content scanning system for electronic documents such as emails 

comprising: 

5 a) a link analyser for identifying hyperlinks in document content; 

b) means for causing a content scaimer to scan objects referenced by links 
identified by the link analyser and to determine their acceptabiUty according to predefined 
rules, the means being operative, when the link is to an object external to the document and is 
determined by the content analyser to be acceptable, to retrieve the external object and modify 
10 Ihe document by 

bL embedding in it or attaching to it the retrieved copy of the object; and 
b2. replacing the link to the external object by one to the copy embedded in, 
or attached to, the document. 

15 2. A system according to claim 1 wherein the link analyser a) and means b) are 

operative to recursively process links identified in such external objects. 

3. A system according to claim 2 in which only a maximum depth of recursion is 

permitted and the document is flagged as unacceptable if that limit is reached. 

20 



4. A system according to claim 1, 2 or 3 wherein acceptable retrieved objects are 

encoded into MIME format. 




9 



5. A system according to any one of the preceding claims wherein if any linked-to 

object is detemiined by the content scanner to be unacceptable the document is flagged or 
modified to indicate that fact 

5 6^ A method of content-scanning electronic documents such as emails 

comprising: 

a) using a link analyser for identifying hyperlinks in document content; 

b) using a content scanner to scan objects referenced by links identified by 
the link analyser and to determine their acceptability according to predefmed mles, the means 

10 being operative, when the link is to an object external to the document and is determined by 
the content analyser to be acceptable, to retrieve the external object and modify the document 
by * 

bl. embedding in it or attaching to it the retrieved copy of the object; and 
b2. replacing the link to the external object by one to the copy embedded in, . 
15 or attached to, the document, 

7. A method according to claim 6 wherein the steps a) and b) are used recursively 

to process links identified in such external objects. 

20 8 . A method according to claim 7 in which only a maximimi depth of recursion is 

peraiitted and the document is flagged as unacceptable if that limit is reached. 

9, A system according to claim 6, 7 or 8 wherein acceptable retrieved objects are 

encoded into MIME format 



25 
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10. A method according to any one of claim 6 to 9, wherein if any linked-to object 
is determined by the content scanner to be unacceptable the document is flagged or modified 
to indicate that fact 

11. A content scanning system for electronic documents substantially as hereinbefore 
described and witihi reference to the accompanying drawings. 

12. A method of content-scanning electronic documents substantially as hereinbefore 
described and with reference to the accompanying drawings. 
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ABSTRACT 

A content scanner for electronic documents such as email scans objects which 
are the target of hyperlinks within the document. If they are determined to be acceptable, a 
copy of the object is attached to the document and the link is replaced by one pointing to the 
copied object. 
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